Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

added a BallistaContext to ballista to allow for Remote or standalone #1100

Merged
merged 36 commits into from
Nov 19, 2024

Conversation

tbar4
Copy link
Contributor

@tbar4 tbar4 commented Oct 28, 2024

Which issue does this PR close?

Closes #1091

Rationale for this change

This adds a standalone context option when initializing a BallistaContext in Ballista Python

What changes are included in this PR?

added a context.py file to the python directory and also added new functionality to the BallistaContext file in python directory (context.rs)

Are there any user-facing changes?

This update should be backwards compatible and will default users to a standalone instance, but I think we should discuss if we should go with the original idea of having the Ballista Context initialization in Python for standalone be it's own module.

@tbar4 tbar4 marked this pull request as ready for review October 31, 2024 21:06
@tbar4 tbar4 marked this pull request as draft October 31, 2024 22:13
@tbar4 tbar4 marked this pull request as ready for review November 1, 2024 17:30
@tbar4
Copy link
Contributor Author

tbar4 commented Nov 1, 2024

The main purpose of this Pull Request is to update the Ballista Context and not necessarily to add all the functionality on top of the BallistaContext class found in DataFusion Python. I hope to get it on par with Datafusion Python in follow up MRs, this should be pretty easy because the class is calling DataFusion Python, so I am assuming I should be able to implement a Mixin that would allow me to get it to parity. Any advice on that is appreciated!

@milenkovicm
Copy link
Contributor

@tbar4 is there any reason to keep BallistaContext and proxy all methods to SeasionContext ? Why we need to duplicate code? Idea is to decrease code we need to maintain. Can't we use SeasionContext directly?

@milenkovicm
Copy link
Contributor

Please have a look at #1091, the main idea is to utilise PySessionContext directly, same effort we're doing with rust interface

@tbar4
Copy link
Contributor Author

tbar4 commented Nov 1, 2024

@milenkovicm updated to be more inline with #1091. In Python, running an init always returns a None, so I just added a build() function to initialize it. For the remote cluster this might come in handy to have it setup this way in the future if we want to do builder pattern similar to Spark. I have a feeling as more features are added to Ballista this may be needed.

@milenkovicm
Copy link
Contributor

great stuff @tbar4, two questions, it would be extra mile if we could do it:

  • can we make standalone optional feature in python as well ?
  • does py3o/python support trait extensions (like in rust), something like
from datafusion import SessionContext

# Create a DataFusion context
ctx = SessionContext.standalone()

or

from datafusion import SessionContext

# Create a DataFusion context
ctx = SessionContext.remove("http:// ...")

@milenkovicm
Copy link
Contributor

milenkovicm commented Nov 2, 2024

@andygrove your opinion please, would it make sense to rename pyballista to datafusion_distributed to align name with datafusion_ray ?

@andygrove
Copy link
Member

@andygrove your opinion please, would it make sense to rename pyballista to datafusion_distributed to align name with datafusion_ray ?

I think that we should stick with the Ballista name for anything in the datafusion-ballista project.

btw, I originally chose pyballista because the ballista name was already taken in PyPi, but that project will soon be deleted by the author (it was just a placeholder), so we can just use ballista now.

@andygrove
Copy link
Member

This is a random place to give this feedback, but eventually, I would like to be able to start the executor and scheduler processes from the Python bindings as well. This was possible at one time.

@tbar4
Copy link
Contributor Author

tbar4 commented Nov 2, 2024

This is a random place to give this feedback, but eventually, I would like to be able to start the executor and scheduler processes from the Python bindings as well. This was possible at one time.

You mean not just connect, but also be able to start a an actual schedule/executor from Python? I was thinking about this considering you really can autoscale with Python, or are you meaning something different?

@milenkovicm
Copy link
Contributor

This is a random place to give this feedback, but eventually, I would like to be able to start the executor and scheduler processes from the Python bindings as well. This was possible at one time.

make sense, do we want to take it in this task or capture this as follow up. I was planning to make scheduler and executor process easier to bootstrap

@tbar4
Copy link
Contributor Author

tbar4 commented Nov 2, 2024

This is a random place to give this feedback, but eventually, I would like to be able to start the executor and scheduler processes from the Python bindings as well. This was possible at one time.

make sense, do we want to take it in this task or capture this as follow up. I was planning to make scheduler and executor process easier to bootstrap

Personally, I would like to make it a follow up. I'm deploying a Ballista instance as an alternative to Spark for testing at work and I don't see it going very far if users can use Python to interact with it. Merging this will at least give me a starting point and it will allow me to focus more on improvements with some breathing room while users test out Python.

But I will am also going to look into making "standalone" an additional feature instead of part of the library today.

@milenkovicm
Copy link
Contributor

@tbar4 would you please create task to expose scheduler and executor in python so we can track it?

I'll check if everything is there on rust side once I finish #1104

@milenkovicm
Copy link
Contributor

I had a quick look at the PR

I think you're overcomplicating it a bit,

if we go back to original idea

from Ballista import Builder
ctx : SessionContext = Ballista.builder.config("ballista.job.name","Super Cool Ballista Python Job").standalone()

you just need BallistaBuilder to cover whole flow, no need to introduce SessionConfig or SessionState

struct BallistaBuilder {
confg: HashMap<String,String>
}


impl BallistaBuilder {
    fn conf(self, key: String, value: String) -> Self {
       self.conf.insert(key, value)
       self
   }

   fn standalone(self) -> PySessionContext {
    // create SessionConfig
    // create SessionState
   // create SessionContext
    PySessionContext { ctx }
   }
}

wdyt @tbar4 ?

@tbar4
Copy link
Contributor Author

tbar4 commented Nov 15, 2024

I had a quick look at the PR

I think you're overcomplicating it a bit,

if we go back to original idea

from Ballista import Builder
ctx : SessionContext = Ballista.builder.config("ballista.job.name","Super Cool Ballista Python Job").standalone()

you just need BallistaBuilder to cover whole flow, no need to introduce SessionConfig or SessionState

struct BallistaBuilder {
confg: HashMap<String,String>
}


impl BallistaBuilder {
    fn conf(self, key: String, value: String) -> Self {
       self.conf.insert(key, value)
       self
   }

   fn standalone(self) -> PySessionContext {
    // create SessionConfig
    // create SessionState
   // create SessionContext
    PySessionContext { ctx }
   }
}

wdyt @tbar4 ?

Simple enough, though in a future issue I would like to implement the state and session config to be on par with the Rust implementation options for defining session state and ballista config options

@tbar4
Copy link
Contributor Author

tbar4 commented Nov 15, 2024

@milenkovicm I am using the deprecated BallistaContext, but just to define config and then passing the context created from the deprecated BallistaContext to the Python DataFusion SessionContext. I have tested this out on a 3 node cluster (AWS r5.2XLarge, 1 scheduler and 2 executors) and work is being routed to both executors. The current SessionContextExt doesn't take any config options as an argument and instead takes the state which contains a config.

In a future PR I can make the switch if necessary, but this PR lets people get up an running with Ballista in Python and utilize all the features of DataFusion Python.

@milenkovicm
Copy link
Contributor

@tbar4 I'm waiting for this pr to remove BallistaCo text, please don't use it if possible.

You can create SessionContext like

    let session_config = SessionConfig::new_with_ballista()
            .with_information_schema(true)
            .with_ballista_job_name("Super Cool Ballista App");

        let state = SessionStateBuilder::new()
            .with_default_features()
            .with_config(session_config)
            .build();

        let ctx: SessionContext = SessionContext::remote_with_state(&url, state).await?;

Would that work in your case?

Copy link
Contributor

@milenkovicm milenkovicm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @tbar4 we're almost there, just few little changes needed IMHO

@tbar4
Copy link
Contributor Author

tbar4 commented Nov 16, 2024

@milenkovicm @andygrove I am pretty sure this is what we want. We have a way to set config in Python, we are using the new SessionContext to build standalone or remote contexts. Let me know if we want any other changes! Otherwise I'm good for PR approval

Copy link
Contributor

@milenkovicm milenkovicm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're almost there

  • Builder is almost there
  • Documentation and code need some cleanup
  • test is failing as it is using non existing class

@tbar4
Copy link
Contributor Author

tbar4 commented Nov 18, 2024

@milenkovicm @andygrove ready for you all to take another look. Need a maintainer to resolve the merge conflict with python/src/context.rs

@milenkovicm
Copy link
Contributor

I think this makes sense now, we can take further updates at the later date.
One small note, there is one unused import which could be removed.

you'll have to rebase your branch @tbar4 and resolve conflicts.
thanks a lot

@tbar4
Copy link
Contributor Author

tbar4 commented Nov 19, 2024

@andygrove ready to be merged.

@milenkovicm thanks for your patience and help!

@andygrove andygrove merged commit d949e5f into apache:main Nov 19, 2024
15 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Ballista Python Update
3 participants